We are now entering the phase of doing serious statistics: we run regression models, inspect them, plot, and tabulate them.
For this exercise, let’s pretend to be researchers interested in the determinants of the subjective risk of being treated for COVID-19 in a hospital. We suspect that age plays an important role, something we have already assessed by visualizing a similar relationship. So… we need the GESIS Panel COVID-19 survey data again:
library(dplyr)
library(haven)
library(sjlabelled)
gp_covid <-
read_sav(
"../data/ZA5667_v1-1-0.sav"
) %>%
set_na(na = c(-1:-99, 97)) %>%
mutate(
likelihood_hospital = hzcy003a,
age_cat = as.factor(age_cat)
) %>%
remove_all_labels()
We start our analysis by checking the relationship with an ANOVA. We also include some very essential covariates: gender, political orientation, and marital status.
+.
serious_anova <-
aov(
likelihood_hospital ~ age_cat + sex + political_orientation + marstat,
data = gp_covid
)
summary(serious_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_cat 9 346 38.40 23.868 <2e-16 ***
## sex 1 1 0.99 0.617 0.432
## political_orientation 1 0 0.11 0.071 0.789
## marstat 1 1 1.06 0.657 0.418
## Residuals 3092 4974 1.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 660 observations deleted due to missingness
Alright, there’s definitely something in the data we should investigate. Oftentimes this is considered bad practice, but we could try to create a median split for the dependent variable and run a logistic regression.
likelihood_hospital and run a logistic regression with all variables. What do you see in the results?
dplyr::mutate() in combination with the ifelse() function.
gp_covid <-
gp_covid %>%
mutate(
likelihood_hospital_cut =
ifelse(
likelihood_hospital > median(likelihood_hospital, na.rm = TRUE),
1,
0
)
)
serious_logistic_regression <-
glm(
likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
data = gp_covid,
family = binomial(link = "logit")
)
summary(serious_logistic_regression)
##
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat, family = binomial(link = "logit"), data = gp_covid)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.259 -1.008 -0.756 1.242 2.070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1525329 0.3987424 -5.398 6.73e-08 ***
## age_cat2 0.5570465 0.4001281 1.392 0.16387
## age_cat3 0.2237200 0.4070216 0.550 0.58256
## age_cat4 0.8552502 0.3851614 2.220 0.02638 *
## age_cat5 1.0188140 0.3837725 2.655 0.00794 **
## age_cat6 1.2015271 0.3783196 3.176 0.00149 **
## age_cat7 1.5248309 0.3643355 4.185 2.85e-05 ***
## age_cat8 1.7367353 0.3739040 4.645 3.40e-06 ***
## age_cat9 1.9148385 0.3750669 5.105 3.30e-07 ***
## age_cat10 1.9574845 0.3732600 5.244 1.57e-07 ***
## sex 0.0687836 0.0777806 0.884 0.37652
## political_orientation 0.0247512 0.0206458 1.199 0.23059
## marstat -0.0007526 0.0481534 -0.016 0.98753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4067.1 on 3104 degrees of freedom
## Residual deviance: 3888.4 on 3092 degrees of freedom
## (660 observations deleted due to missingness)
## AIC: 3914.4
##
## Number of Fisher Scoring iterations: 4
# Interpretation:
# With increasing age, the probability of thinking it's likely to be
# hospitalized because of COVID-19 increases. But really?
Running one model is not enough. It may be that the assumptions for logistic regression are not fulfilled, or a reviewer simply doesn’t like these types of regressions. Instead, she proposes to run a binomial regression but with a cauchit link.
?family to see how you can include a cauchit link.
serious_cauchit_regression <-
glm(
likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
data = gp_covid,
family = binomial(link = "cauchit")
)
summary(serious_cauchit_regression)
##
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat, family = binomial(link = "cauchit"), data = gp_covid)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2523 -1.0064 -0.7601 1.2431 2.0485
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.548363 0.849671 -2.999 0.00271 **
## age_cat2 1.036283 0.875953 1.183 0.23680
## age_cat3 0.502607 0.912892 0.551 0.58193
## age_cat4 1.427898 0.852017 1.676 0.09376 .
## age_cat5 1.605221 0.848333 1.892 0.05846 .
## age_cat6 1.792792 0.843912 2.124 0.03364 *
## age_cat7 2.072772 0.837850 2.474 0.01336 *
## age_cat8 2.246052 0.840343 2.673 0.00752 **
## age_cat9 2.386031 0.840525 2.839 0.00453 **
## age_cat10 2.419927 0.840032 2.881 0.00397 **
## sex 0.046430 0.069014 0.673 0.50109
## political_orientation 0.017770 0.018113 0.981 0.32655
## marstat -0.004667 0.040372 -0.116 0.90796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4067.1 on 3104 degrees of freedom
## Residual deviance: 3889.1 on 3092 degrees of freedom
## (660 observations deleted due to missingness)
## AIC: 3915.1
##
## Number of Fisher Scoring iterations: 5
# Interpretation of difference:
# I don't know...
Using different link functions is sometimes done as they provide different model fits. That’s definitely something we should investigate as well.
test = "LRT" to perform a likelihood ratio test. What’s your interpretation?
anova(
serious_logistic_regression,
serious_cauchit_regression,
test = "LRT"
)
## Analysis of Deviance Table
##
## Model 1: likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat
## Model 2: likelihood_hospital_cut ~ age_cat + sex + political_orientation +
## marstat
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3092 3888.4
## 2 3092 3889.1 0 -0.72287
# Interpretation of difference:
# It seems there is no big difference. We can make the reviewer happy with the
# cauchit models and still keep our results.